Multifractal analysis of maize and soybean DNA

This paper investigates the complexity of DNA sequences in maize and soybean using the multifractal detrended fluctuation analysis (MF-DFA) method, chaos game representation (CGR), and the complexity-entropy plane approach. The study aims to understand the patterns and structures of these DNA sequences, which can provide insights into their genetic makeup and improve crop yield and quality. The results show that maize and soybean DNA sequences exhibit fractal properties, indicating a complex and self-organizing structure. We observe the persistence trend between sequences of base pairs, which indicates long-range correlations between base pairs. We also identified the stochastic nature of the DNA sequences of both species.

and the complexity of the time series, providing insights into the structure and dynamics of the system under investigation 31 .
Due to the importance of maize and soybean not only for the world economy but also for the planet's food security, in this study, we investigate the properties of sequences DNA of these commodities, using a multifractal detrended fluctuation analysis (MF-DFA) method, Chaos Game Representation, and plane complexity-entropy to analyze the behaviour scale and determine the fractality of nucleotide sequences.For this, we use the database available on the NCBI website 32 .We define a function to transform the sequence of base pairs {A, C, G, T} into a time series.Our results indicate that both species exhibit fractal behaviour along DNA sequences and power law correlations between base pairs.The time series generated by DNA sequences present high persistence and stochastic behaviour, with implies that it has a long-term memory and a tendency to remain close to its past values, while its short-term fluctuations are random.

Chaos game representation
The chaos game representation technique is a generalized Markov chain and allows a unique representation of a nucleotide sequence.A mapping rule that transforms a sequence into a two-dimensional picture can reveal fractal structures and has shown promise in recognizing underlying local and global patterns or nucleotide selection bias in gene sequences 23,24 .Mathematically chaos game representation, is described by an iterative system function, where for each new base pair we obtain a set of coordinates (p, q).The algorithm of this approach follows the following steps 22-24,33 : 1.The nucleotides "A", "T/U", "G" and "C" are positioned at the vertices of a square centered at the origin, with coordinates (0, 0).We denote the location of the vertices V A = (−1, 1) , V C = (−1, −1) , V G = (1, 1) and V T = (1, −1) corresponding to the bases A, C, G and T, respectively.2. Given a sequence of base pairs, the first point of the representation is placed at the midpoint between the center of the square and the vertex indicated by the monomer of the first nucleotide.3. The position of the second point in the representation is obtained by placing it at the midpoint between the position of the first nucleotide and the square of the vertex indicated by the same letter as the second nucleotide.4. The positions of each subsequent nucleotide are obtained as the midpoint between the position of the previous nucleotide and the vertex corresponding to the current nucleotide.Mathematically, the positions (p, q) i+1 are obtained by the recurrence relation: where j ∈ {A, C, T, G} and we start from the center of the square (p, q) 0 = (0, 0).In this representation, each point in the CGR corresponds precisely to a subsequence (starting from the first base), and the entire original subsequence of nucleotides up to the current nucleotide can be reconstructed just by knowing the corresponding point in the CGR 34 .
An essential application of the CGR is to assess the abundance of k-mers in a series of nucleotides 34 .A k-mers corresponds to a subsequence of k bases.This approach takes advantage of the uneven distribution of subsequences of length k (k = 1, 2, 3, . . . ) along the nucleotide chain.For example, if we have the DNA sequence "ATC GAT CGA" and set k = 3, then the 3-mers would be: ATC, TCG, CGA, GAT, ATC, TCG,CGA.The CGR algorithm generates a square with subquadrants divided by grids where it is possible to represent the frequency of these 3-mers in an image.In Fig. 1 we represent the two-dimensional image generated by the FCGR algorithm.The image is a square with subquadrants where each subquadrant represents a pixel to a given k-mer.For each subquadrant, we associate a gray level that corresponds to the frequency of occurrence of k-mers in the sequence.
Let us consider, for example, the artificial sequence "ACGT".In this case, each k-mer of length k = 1 must belong to one of these four quadrants in Fig. 1 middle column on the left.We get one point in each quadrant because we have precisely four different letters in "ACGT".FCGR counts the occurrence of monomers in each quadrant and assigns a relative grayscale value.Generally, the greater the number of occurrences (the frequency), the darker the quadrant, and vice versa.Therefore, for the string "ACGT", each corresponding quadrant is represented by the same gray level.For a different sequence, like "TTCA", we have two points in the T quadrant, one point in the C quadrant, one point in the A quadrant, and no points in the G quadrant.Thus, the gray level of quadrant T is twice as high as that of quadrants C and G, while quadrant A is white, as shown in Fig. 1 lower left line.
The representation k = 3 , in the right column of Fig. 1, corresponds to: For subsequence "ACGT": ACG, CGT.For subsequence "TTCA": TTC and TCA.All have the same degree of gray, as they occur with the same frequency in the sequence, and the other representations are blank, as they do not happen in the sequence.
In the same way, we count the frequency of 2-mers in the chains.In the middle row of Fig. 1, we represent the occurrence of the 2-mers for the sequences "ACGT" and "TTCA".The 2-mers in these sequences have the same shade of gray since they appear with the same frequency, and the other quadrants appear in white since the 2-mers that it represents do not appear.The lower lines on the right of Fig. 1 show the 3-mers representation of the "ACGT" and "TTCA" sequences. (1)

Global distance
The Ref. 23 proposes using the FCGR to determine the dissimilarity between DNA sequences through global distances between sequences on a given scale.For this, we calculate the global distance d between two FCGRs based on Pearson's Weighted Correlation Coefficient, rw p,q , using the following equations where p and q are the coordinates of the quadrants in FCGR, each containing the occurrence of the same k oligomeric sequences.The modification of Pearson's standard definition consists of weighting the variance with the frequency nw to determine the correlation between the two sets of quadrants.The advantage of using this coefficient definition is

Time series
The four nitrogenous bases that comprise DNA are represented by the letters {A, C, G, T} (adenine, cytosine, guanine, and thymine, respectively).We create a function f that maps the four nitrogenous bases that make up the DNA sequence into four distinct values.
In writing, we use the following notation: Consequently, we have a sequence of valeus {x k : k = 1, 2, . . ., N} with x k ∈ {±1.± 2} .To build our time series x(t), we perform a cumulative sum of the values of x k .Each value of the cumulative sum will result in a value that corresponds to a temporal measurement t.
A similar definition was used in 34,35 and aimed to distinguish purines (A and G) from pyrimidines (C, T, U).

Ordinal patterns
Ordinal pattern methods involve mapping a time series to a sequence of patterns or ranks, where each pattern reflects the order of values in a given window.This mapping enables the study of complex systems by computing various metrics, including permutation entropy and complexity-entropy plane 25,[36][37][38] .In 2002, Bandt and Pompe introduced these methods as a simple, robust, and computationally efficient way to measure complexity in time series data 31 .This measure is defined as the Shannon entropy of a probability distribution associated with ordinal patterns evaluated from partitions of a time series -a process known as the Bandt-Pompe symbolization approach.Let be {x(t) : t = 1, 2, . . ., N} a time series with N observations.We divide the series into n x = N − (d x − 1)τ x non-overlapping partitions, composed of d x > 1 elements and separated by time τ x ≥ 1 .For a given d x and τ x , we obtain partitions set w p = (x p , x p+τ x , . . ., x p+(d x −1)τ x ) where p is the index of the partition.
Next, we sort the elements of each partition in ascending order, i.e., for each partition w p , we evaluate the per- mutation π p = (r 0 , r 1 , . . ., r d x +1 ) of the index numbers (0, 1, . . ., d x − 1) that sorts the elements of w p in ascending order.The permutation of the index numbers defined by the inequality x p+r 0 ≤ x p+r 1 ≤ • • • ≤ x p+r dx−1 e in case of equal values, we maintain the occurrence order of the partition elements.After evaluating the permutation symbols associated with all data partitions, we obtain a symbolic sequence {π p } p=1,...,n x .For more details about this method, we recommend the Refs. 25,38,39he Ordinal Probability Distribution {ρ i (� i )} i=1,...,n π is the relative frequency of all possible permutations within the symbolic sequence, given by where i represents each of n π = d x !different ordinal patterns.
With the ordinal probability distribution, we can calculate the Shannon entropy of permutation Entropy, in this context, refers to the degree of disorder or randomness in a time series.Specifically, permutation entropy is a measure of the unpredictability of the order of patterns in a time series such that S ≈ log n π indicates randomness and S ≈ 0 indicates more regular dynamics.Because the maximum value of S is S max = log n π , we can further define the normalized permutation entropy as where the value of H is restricted to the interval [0, 1].Another essential measure to characterize a series is complexity.In addition to Bandt and Pompe's symbolization approach, the complexity-entropy plane is a well-known technique for analyzing time series data 31 .It offers a two-dimensional representation space based on permutation entropy H and an intensive statistical complexity measure C.This approach, initially created to distinguish between chaotic and stochastic time series, has proven helpful in various situations, including pattern recognition and classification 27,28 .
The statistical complexity measure used in this method was inspired by Lopez Ruiz's work 40 and is defined by Jensen-Shannon divergence between the ordinal distribution P = ρ i (� i ) i=1,...,n π and the uniform distribution U = {1/n π } i=1,...,n π .Mathematically, we can write this complexity as where www.nature.com/scientificreports/ is the Jensen-Shannon divergence and D max is the normalization constant given by The existence of nontrivial structures is quantified by complexity.The statistical complexity C = 0 in both the extremes of order (when only one permutation symbol happens) and disorder (when all permutations are equally likely to occur), in contrast to the permutation entropy, which is non-zero.The value of C measures structural complexity and conveys extra details that the value of H does not.Furthermore, there are a variety of alternative values for C for a given value of H, making C a nontrivial function of H.A more detailed discussion of the meaning of C complexity can be found in Ref. 40 .

Multifractal dentendred flutuation analysis
Assume that {x(t) : t = 1, 2, . . .N} is a time series with N data points.The Multifractal detrended fluctuations analysis procedure consists of the following steps 20 : 1. We determine the profile where x(t) is the average of the time series.2. The profile Y(i) is divided into N s = int(N/s) non-overlapping segments of equal length s.Since N will not always be a multiple of s, a final part of the profile may be left over.To avoid discarding this part of the series, the same procedure is repeated starting from the end.So we will get 2N s segments.3. Calculate the local variance for each of the 2N s segments by least squares fit for each segment v, v = 1, 2, . . ., N s , and for each segment v = N s + 1, N s + 2, . . ., 2N s .Here y v (i) is the fit polynomial in the i segment and is chosen based on the time series trend.We can use polynomials of different orders in the fitting process so that we will have polynomials of linear (DFA1), quadratic (DFA2), cubic (DFA3), and higher orders.4. So far, we have obtained F(v, s) which is the variance of each segment v of size s with an arbitrary polynomial.
We define the q− th order of the fluctuation function by averaging all 2N s segments When q = 2 , we return the default DFA technique.For different values of q, we are interested in how the fluctuation function F q (s) varies on each length scale s.We repeat steps 2 through 4, varying s, 5.If there is a long-range power law correlation in the series x k , F q (s) increases for large values of s, mimicking a power law where h(q) is the generalized Hurst exponent.A time series is monofractal if the Hurst exponent H remains constant regardless of the value of q.On the other hand, if h(q) varies with q, the time series is multifractal.The spectrum of h(q) is determined by the slopes of the F q (s) vs. s graph for different q values 20,21 .The variations in h(q) are examined to assess the impact of scale fluctuations.The difference between the asymptotic values of h(q), denoted as �h(q) = h q min − h q max , is com- puted to measure the departure from monofractal behavior.The parameter �h(q) = 0 in monofractal series.The magnitude of �h(q) indicates the multifractality and dynamics complexity level in the time series.See References for a more detailed explanation and calculation of the generalized Hurst exponent 41 .
The MF-DFA technique is unsuitable for strongly anti-correlated series where h(q) approaches zero, as it only calculates positive generalized Hurst exponents.In order to address this issue, a modified MF-DFA approach has been recommended.This modification, represented by a double sum substitution in Eq. ( 7), provides a more appropriate method for analyzing such data 20 Following the MF-DFA procedure as described above, we obtain generalized fluctuation functions Fq (s) described by a scaling law as in Eq. ( 11), but with higher exponents h(q) = h(q) + 1 Thus, the scaling behavior can be accurately determined even if h(q) is less than zero for some values of q.The multifractal scale exponent τ (q) of the form can be used to understand the dependency on q in the multifractal situation which depends on the generalized Hurst exponent h(q).The properties of multifractality are more robust as the nonlinear relationship between τ and h(q) is more potent.
The multifractal spectrum (α, f (α)) , which is related to the multifractal scale spectrum τ (q) through a first- order Legendre transformation 42,43 , is another approach to represent the multifractal of a time series.If τ (q) is sufficiently smooth, the singularity's strength, α , is given by from which the singularity spectrum f (α) can be constructed The graph of f (α) vs α , also known as the multifractal spectrum or spectrum of singularities, reflects the prop- erties of the profile of h(q).The exponent α reveals the differences in scale exponents, and the magnitude of the singularity force α is higher for time series with stronger multifractality centered on the prominent scale h.The function f (α) reaches its maximum value when q = 0 , with max f (α) = 1 .In a monofractal series, where α = τ ′ (q) = H , the sets representing f (α) collapse to a single point.
We also define the symmetry parameter B given by The spectrum is symmetric if B = 1 .Subsets exhibiting minor fluctuations generally have a more pronounced impact on the multifractal spectrum when B > 1 , suggesting a directly symmetric spectrum.Conversely, if B < 1 , the multifractal spectrum skews toward the left, with the larger fluctuations tending to exert a greater influence on it.See References for a thorough evaluation of the generalized Hurst coefficients' significance and interpretation 20,41,44 .

Results and discutions
Maize and soybean nucleotide sequences are available from the National Center for Biotechnology Information-NCBI 32 .We used the complete sequences of the 10 chromosomes that make up maize and 20 chromosomes that make up soybean to apply the analysis tools.

Chaos game representation
We obtained chaos game representations for all 30 chromosomes with different scales k.We use the code available in 33 .This representation allows the visualization of repetition patterns in nucleotide sequences.This approach allows us to visualize geometric patterns like parallel lines, squares, rectangles, and triangles.The abundance of nucleotide sequences in the image is reflected through the degree of gray so that the more abundant the k, the darker the quadrant that represents it.The CGR image can reveal the overall base composition of the DNA sequence.Different regions of the image correspond to different nucleotide frequencies.In Fig. 2, we present the frequency of 3−mers, 5−mers, and 6− mers for the randomly chosen chromosomes 2 and 5 for maize.These results correspond to the degree of pixelation k = 3, 5 and 6 , respectively.At these degrees of pixelation, all possible combinations of nucleotide sequences are displayed.In Fig. 3, we present the results with the same scales k = 3, 5 and 6 for the soybean chromosomes 2 and 5.The other chromosomes present patterns similar to those presented.Visually, the images generated by the soybean sequences appear to have a more explicit fractal behavior, with better-defined geometric patterns.
By generating CGR for all 30 chromosomes using various scales, we identified a range of fractal shapes, including parallel lines, squares, rectangles, and intricate fractal structures.This discovery highlights the underlying principles that govern the arrangement of nucleotides and opens up new ways for understanding the functional and evolutionary aspects of the genome.We can see that the distribution of degrees of gray has a behavior that is not random for both species.
When a Chaos Game Representation (CGR) image displays global patterns of squares and parallel lines, it suggests the presence of specific structural elements or motifs within the DNA sequence.The squares observed in the CGR image indicate regions of the sequence that exhibit repetitive patterns.These squares represent areas where specific nucleotide sequences or structural elements occur repeatedly.Moreover, the presence of parallel   www.nature.com/scientificreports/ the distribution of trimers in chromosome sequences is related.In this sense, the species of maize and soybean are very similar.

Time series and ordinal patterns
This construction step of time series from chromosome sequences is essential for applying the MF-DFA and ordinal patterns methods.We use the f mapping rule defined in the "Time series" section.We reinforce that, as different mapping rules can be made to transform a sequence of symbols (DNA sequence) into a time series, we can obtain different fractal parameters that characterize the data.However, as we apply the same rule to both species, we can obtain important information by comparing the obtained fractal parameters.The main statistical characteristics of the resulting time series are shown in Table 2 and the time series representations, for some chromosomes chosen randomly, are shown in Fig. 4 for both species.We can observe in the graphs of the walks that positive values tend to appear, indicating the concentration of A and G in the nucleotide sequences, as defined by our mapping rule.
We calculated entropy H and complexity C time series generated for all maize and soybean chromosomes.We divide the time series into n x partitions of sizes d x = 3 and τ x = 1 .We use the Ordpy library introduced by 38 and available in 45 .
We plot the values obtained in the complexity-entropy plane for each chromosome; see Fig. 5. Entropy values for soybean are around H ≈ 0.855 and complexity C ≈ 0.145 .For maize, H ≈ 0.906 and C ≈ 0.091 .Both series have high entropy and low complexity, indicating stochastic process characteristics.As the entropy H for maize is more significant than for the soybean, there is more genetic information for maize when we compare it with the soybean.Moreover, we also can say that the time series generated by corn has unpredictable patterns; that is, it has more random patterns for soybeans.It can be translated into a more blurred fractal pattern in the CGR of Fig. 2. On the other hand, soybean is more complex than maize, i.e., with higher complexity C. The statistical complexity quantifies the existence of non-trivial structures.In the cases of perfect order and total randomness, C = 0 means the data possesses no structure.Between these two extreme instances, an extensive range of possible values quantifies the level of structure in the data.The statistical complexity can detect subtle details of the dynamical processes that generate the data.In this sense, we can say that soybeans have a more complex structure than maize.This same result is corroborated by the CGR, where soybean has a more evident fractal structure than maize.
In the context of maize and soybean DNA sequences, it is critical to consider the C-value paradox, especially given the significant disparity in genome sizes between the two species.The "C-value paradox" is a term used in biology to describe the apparent disconnect between genome size and organism complexity 46,47 .Although maize has a much larger number of base pairs, this quantity does not translate into a more organized genomic structure and greater complexity of the organism.One possible explanation is that the soybean genome may have a relatively lower proportion of repeated sequences and mobile genetic elements compared to corn, which contributes to a clearer organization and more uniform genomic structure.Furthermore, soybeans may have undergone processes that favored genome compaction and the elimination of unnecessary or redundant sequences, resulting in a more efficient and cohesive organization of DNA.

MF-DFA analysis
We also applied the MF-DFA analysis to all 30 chromosomes.We use a Python library for MF-DFA introduced in Ref. 48and available on Github 49 .We determine the generalized exponents and the multifractal spectra.We use the second-order polynomial fit (DFA2) over a segment interval s (100, 4, 000, 000) with step 1000 to obtain these results.
For comparison, we show two other artificial sequences: a periodic sequence constructed from the repetition of the letters "ATGC" 7, 500 times and another sequence with 30, 000 base pairs constituted of the letters "A", "T", "G", and "C" randomly distributed.We made this comparison because these artificial time series present interesting behavior: The periodic series does not present a fractal pattern, and therefore, its fluctuation function is independent of q, while the random series presents a weak correlation between the nucleotides.
For the random sequence, one gets H 0.5 and reveals a weakly correlated nucleotide sequence, as expected for a random sequence.For the periodic sequence, h(q) = 0 for all values of q (grey), and it reveals a non-fractal behavior.As seen in Fig. 6, for some chromosomal sequences, one obtains 0.97 H , indicating that fluctuations in base pair sequences exhibit a highly persistent nature.The other chromosomes present the same behavior, and the Hurst exponents' values for each one are shown in the Table reftab:estatistica.Persistence is characterized by the tendency of the time series to be followed by positive values (long-range correlation) when presenting positive values in the sequence.It means that when one of the base pairs Adenine and Guanine occurs, and there is a tendency for these nitrogenous bases to continue appearing over a long period, the same behavior is valid for the non-occurrence of these bases.
The h(q) spectra for all chromosomal sequences show relatively small variation with q; see h in Table 2.The width of the h(q) plot can give insights into the degree of multifractality in a time series.If the width is narrow, it suggests a weak correlation between different scales of the time series.It is a simple fractal structure that a small number of scaling factors can describe.On the other hand, a broad width indicates a strong correlation between different scales.On average, we got � h� maize = 0.369 for maize and � h� soybean = 0.2915 for soybean, indicating that maize has a more heterogeneous sequence than soybean, characterized by a well-defined multifractal structure with a long-range power-law correlation between nucleotides and a relatively more significant number of scale factors.
The multifractal spectra obtained from Eq. ( 16) for all the curves show concave behavior with maxima at scaling indices α = h(2) .See Fig. 7 and Table 2.In the periodic sequence, the spectrum degenerates to a single point.The width of f (α) is a measure of the degree of multifractality: the greater the width, the more heteroge- neous the fractal, i.e., the greater the complexity of the generating process of the analyzed series and the greater the difficulty in making predictions.On average we got ��α� maize = 0.495 for maize and ��α� soybean = 0.397 for soybean.In this sense, maize has a greater mean variation, indicating that it has a more complex generator complex and is more difficult to make predictions about the time series.
Parameter B is more significant than 1 for most maize and soybean chromosomes.In this sense, we noticed that the soybean chromosomes present significant asymmetry.The left asymmetry indicates that the time series has higher complexity and variability at more minor scales, with fluctuations becoming less significant as the scale increases.On the other hand, chromosomes 4, 5 and 6, for maize, show right asymmetry and indicate that more significant fluctuations in chromosome sequences contribute more significantly to the multifractal spectrum.
The MF-DFA method is a powerful multifractal analysis tool and is a robust, well-known, widely used and easily applicable method.In addition to this, we can highlight other different analysis approaches that can be used to study vegetable sequences, such as multifractal detrended cross-correlation analysis, WTMM and its www.nature.com/scientificreports/variants 50,51 .Different approaches can be used to address this problem and offer distinct and complementary perspectives on the (multi)fractal characteristics of plant genetic sequences.Combining and comparing these methods provides a more complete and robust understanding of the temporal dynamics of the systems studied, allowing deeper insights into their complexity and emergent behavior over time.

Conclusion
We apply the Chaos game representation, ordinal patterns, and MF-DFA approach to study the characteristics of maize and soybean sequences.We investigated structural proprieties across multiple scales using these methods.The information obtained from this analysis helps classify and characterize genomic data.Through these approaches, it was demonstrated that:  • Through the Chaos Game Representation (CGR) method, we analyzed a set of DNA and protein sequences and generated fractal-like images that revealed unique patterns and features of the input sequences.The results from this method indicate that soybean sequences have a fractal structure more defined than maize sequences.• This complexity in the soybean structure is also detected through the complexity measure C.
• CGR reveals the presence of power-law correlations at different scales for sequence DNA sequence.This result is corroborated by the Hurst exponent H values, in addition to indicating the persistent nature of the time series.• Calculating the distance parameter d between all chromosomes, we conclude that the base pair sequences between the two species show high similarity.• The mapping of base pairs of the sequences into numerical values informed us of the presence, in greater concentration, of the Adenine and Guanine bases in both species.• The permutation entropy indicates that maize sequence is more random than soybean.
• Through the MF-DFA approach, we observe that, in the mean, the chromosomes from maize have a more complex multifractal structure than chromosomes from soybean; that is, more scaling factors are needed to characterize the sequence from maize than from soybean.• The maize sequence presents a high degree of heterogeneity, characterized by the greater complexity of the time series' generating process and complex prediction than the time series generated from soybean.• The high left symmetry of the soybean sequences indicates that the time series has greater complexity and variability on small scales than the series generated by maize.• The plane complexity-entropy reveals that both time series have stochastic process characteristics.
• In summary, maize sequences have a more complex and less random structure than maize.This complexity is translated through a better-defined fractal structure.Maize, on the other hand, has a more random and less complex structure.
Despite these important and promising results, we emphasize the need to connect these findings with biological meaning.The frequencies of k-mers may have implications for the occurrence of proteins in these vegetables.Furthermore, the MF-DFA analysis can have a lot to say about the mutations that these vegetables undergo over time.Therefore, a deeper approach that connects these results could be a promising next step.Additionally, we stress the significance of conducting further research with more closely related species regarding phylogeny and genome size, as this is essential for extending and verifying the findings found thus far.These supplementary investigations will enable a deeper comprehension of the connections between genomic structures and provide context for the present findings.We aim to enhance our understanding of the fractal and complexity characteristics of the genomic sequences in these plants by integrating these supplementary investigations.Figure 6.The Generalized Hurst exponents h(q) for maize (left) and soybean (right) chromosomes were chosen randomly.The vertical black line at q = 2 helps to visualize the values h(2).This same behavior is observed in the other chromosomes.

Figure 1 .
Figure 1.Quadrants in FCGR at different pixelation levels k.In the first line, each quadrant uniquely corresponds to a specific string of length k: k = 1 (Left column), k = 2 (Middle column) and k = 3 (Right column) k = 3 (Top row).In the middle line, we have FCGR of the "ACGT" sequence, with different scales k.On the bottom line, FCGR representation of the "TTCA" sequence. https://doi.org/10.1038/s41598-024-60722-2 https://doi.org/10.1038/s41598-024-60722-2www.nature.com/scientificreports/lines in the CGR image indicates the presence of periodic or alternating patterns within the DNA sequence.These lines can signify regions where the DNA sequence exhibits a periodicity or a repeated pattern of nucleotides or base compositions.

Figure 2 .
Figure 2. FCGR for randomly chosen maize chromosomes.The first column indicates the results for chromosome 2 and the second for chromosome 5.Each row shows different scales for various k scales.Top: k = 3 , middle: k = 5 and bottom: k = 6 .All maize chromosomes exhibit similar FCGR behavior.

Figure 3 .
Figure 3. FCGR for soybean chromosomes.The first column indicates the results for chromosome 2 and the second for chromosome 5.Each row shows different scales for various k scales.Top: k = 3 , middle: k = 5 and bottom: k = 6 .All soybean chromosomes exhibit similar FCGR behavior.

Figure 4 .
Figure 4. Time series representation of chromosomes maize (top) and soybean (bottom), as described in "Time series" section.Note that all walks tend to go to higher values, meaning a high concentration of bases A and G.

Figure 5 .
Figure 5. Plane complexity-entropy for maize and soybean chromosomes.Continuous lines represent minimum C min and maximum C max complexities.We zoomed in the region to better visualize the points.

Figure 7 .
Figure 7. MF-DFA analysis of RNA sequences.The f (α) spectra vs scaling indices α for sequences of maize (left) and soybean (right) DNA.
that the importance of each quadrant is proportional to the frequency of the oligomer it represents.The distance d between two DNA sequences is defined by and has a value between 0 and 2. Values close to zero correspond to exact similarity between sequences and values greater than one would correspond to negative transformation coefficients between sequences.The value of d is specific to the resolution of frequency decompositions (FCGR) being detected.

Table 2 .
We show the main statistical characteristics of the time series generated by the sequences of base pairs of maize and soybean chromosomes.The first, second, third, fourth, fifth, and sixth columns indicate the chromosome, size of each sample, maximum and minimum values, and the samples' mean and variance, respectively.We also present the main fractal measures: The seventh column contains the Hurst exponent H.The eighth, ninth, and tenth columns are, respectively, variations of h = h max − h min , �α = α max − α min and symmetry parameter B.